library(tidyverse)
library(gapminder)Introduction to Tidyverse
tidyverse
It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. (Dasu and Johnson, 2003)
For most applied researchers, data preparation usually involves 3 main steps.
- Transforming data frames, e.g. filtering, summarizing, and conducting calculations across groups.
- Tidying data into the appropriate format.
- Merging or linking several datasets to create a bigger dataset.
The tidyverse is a suite of packages designed specifically to help with these steps. These are by no means the only packages out there for data wrangling, but they are increasingly popular for their readable, straightforward syntax and sensible default behaviors.
In this chapter, we’re going to focus on how to use the dplyr package for data transformation tasks.
Gapminder
For this unit, we'll be working with the "Gapminder" dataset again.
gap <- gapminder
kable(head(gap))| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.8 | 8425333 | 779 |
| Afghanistan | Asia | 1957 | 30.3 | 9240934 | 821 |
| Afghanistan | Asia | 1962 | 32.0 | 10267083 | 853 |
| Afghanistan | Asia | 1967 | 34.0 | 11537966 | 836 |
| Afghanistan | Asia | 1972 | 36.1 | 13079460 | 740 |
| Afghanistan | Asia | 1977 | 38.4 | 14880372 | 786 |
Why dplyr?
So far, you’ve seen the basics of manipulating data frames, e.g. subsetting and basic calculations. For instance, we can use base R functions to calculate summary statistics across groups of observations, e.g., the mean GDP per capita within each region:
mean(gap$gdpPercap[gap$continent == "Africa"])
#> [1] 2194
mean(gap$gdpPercap[gap$continent == "Americas"])
#> [1] 7136
mean(gap$gdpPercap[gap$continent == "Asia"])
#> [1] 7902But this isn't ideal because it involves a fair bit of repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.
Luckily, the dplyr package provides a number of very useful functions for manipulating dataframes. These functions will save you time by reducing repetition. As an added bonus, you might even find the dplyr grammar easier to read.
Here, we're going to cover 7 of the most commonly used dplyr functions. We'll also cover pipes (%>%), which are used to combine them.
select()filter()group_by()summarize()mutate()arrange()count()
If you have have not installed tidyverse, please do so now:
# not run
# install.packages('tidyverse')
require(tidyverse)dplyr Functions
Select Columns with select
Imagine that we just received the gapminder dataset, but are only interested in a few variables in it. We could use the select() function to keep only the variables we select.
year_country_gdp <- select(gap, year, country, gdpPercap)
kable(head(year_country_gdp))| year | country | gdpPercap |
|---|---|---|
| 1952 | Afghanistan | 779 |
| 1957 | Afghanistan | 821 |
| 1962 | Afghanistan | 853 |
| 1967 | Afghanistan | 836 |
| 1972 | Afghanistan | 740 |
| 1977 | Afghanistan | 786 |
If we open up year_country_gdp, we'll see that it only contains the year, country and gdpPercap. This is equivalent to the base R subsetting function:
year_country_gdp_base <- gap[,c("year", "country", "gdpPercap")]
kable(head(year_country_gdp))| year | country | gdpPercap |
|---|---|---|
| 1952 | Afghanistan | 779 |
| 1957 | Afghanistan | 821 |
| 1962 | Afghanistan | 853 |
| 1967 | Afghanistan | 836 |
| 1972 | Afghanistan | 740 |
| 1977 | Afghanistan | 786 |
We can also use select to rename columns:
year_country_gdp_named <- select(gap, Year = year, Country = country, GDP_per_capita = gdpPercap)But, as we will see, dplyr makes for much more readable, efficient code because of its pipe operator.
The Pipe
Above, we used what's called 'normal' grammar, but the strengths of dplyr lie in combining several functions using pipes. Since the pipes grammar is unlike anything we've seen in R before, let's repeat what we've done above using pipes.
In typical base R code, a simple operation might be written like:
# NOT run
cupcakes <- bake(pour(mix(ingredients)))A computer has no trouble understanding this and your cupcakes will be made just fine, but a person has to read right to left to understand the order of operations - the opposite of how most western languages are read - making it harder to understand what is being done!
To be more readable without pipes, we might break up this code into intermediate objects:
## NOT run
batter <- mix(ingredients)
muffin_tin <- pour(batter)
cupcakes <- bake(muffin_tin)But, this can clutter our environment with a lot of variables that aren't very useful to us. Plus, these variables are often are named very similar things (e.g. step, step1, step2...) which can lead to confusion and the creation of hard-to-track-down bugs.
Enter the Pipe...
The pipe makes it easier to read code by laying out operations from left to right -- each line can be read like a line of a recipe for the perfect data frame!
Pipes take the input on the left side of the %>% symbol and pass it in as the first argument to the function on the right side.
With pipes, our cupcake example might be written like:
## NOT run
cupcakes <- ingredients %>%
mix() %>%
pour() %>%
bake()Tips for Piping
Remember that you don't assign anything within the pipes -- that is, you should not use <- inside the piped operation. Only use this at the beginning of your code if you want to save the output.
Remember to add the pipe
%>%at the end of each line involved in the piped operation. A good rule of thumb: since RStudio will automatically indent lines of code that are part of a piped operation, if the line isn’t indented, it probably hasn't been added to the pipe. If you have an error in a piped operation, always check to make sure the pipe is connected as you expect.In RStudio, the hotkey for the pipe is Ctrl + Shift + M.
select & Pipe (%>%)
Since the pipe grammar is unlike anything we've seen in R before, let's repeat what we did above with the gapminder dataset using pipes:
year_country_gdp <- gap %>% select(year, country, gdpPercap)Let's walk through it step by step.
First, we summon the gapminder data frame and pass it on to the next step using the pipe symbol %>%.
The second step is the select() function. In this case, we don't specify which data object we use in the call to select() since we've piped it in from the previous line.
Fun Fact: There is a good chance you have encountered pipes before in the shell. In R, a pipe symbol is %>% while in the shell it is |. But the concept is the same!
Filter Rows with filter
Now let's say we're only interested in African countries. We can combine select and filter to select only the observations where continent is Africa.
year_country_gdp_africa <- gap %>%
filter(continent == "Africa") %>%
select(year, country, gdpPercap)As with last time, first we pass the gapminder dataframe to the filter() function, then we pass the filtered version of the gapminder dataframe to the select() function.
To clarify, both the select and filter functions subset the data frame. The difference is that select extracts certain columns, while filter extracts certain rows.
Note: The order of operations is very important in this case. If we used select first, filter would not be able to find the variable continent since we would have removed it in the previous step.
Calculate Across Groups with group_by
A common task you'll encounter when working with data is running calculations on different groups within the data. For instance, what if we wanted to calculate the mean GDP per capita for each continent?
In base R, you would have to run the mean() function for each subset of data:
mean(gap$gdpPercap[gap$continent == "Africa"])
#> [1] 2194
mean(gap$gdpPercap[gap$continent == "Americas"])
#> [1] 7136
mean(gap$gdpPercap[gap$continent == "Asia"])
#> [1] 7902
mean(gap$gdpPercap[gap$continent == "Europe"])
#> [1] 14469
mean(gap$gdpPercap[gap$continent == "Oceania"])
#> [1] 18622That's a lot of repetition! To make matters worse, what if we wanted to add these values to our original data frame as a new column? We would have to write something like this:
gap$mean.continent.GDP <- NA
gap$mean.continent.GDP[gap$continent == "Africa"] <- mean(gap$gdpPercap[gap$continent == "Africa"])
gap$mean.continent.GDP[gap$continent == "Americas"] <- mean(gap$gdpPercap[gap$continent == "Americas"])
gap$mean.continent.GDP[gap$continent == "Asia"] <- mean(gap$gdpPercap[gap$continent == "Asia"])
gap$mean.continent.GDP[gap$continent == "Europe"] <- mean(gap$gdpPercap[gap$continent == "Europe"])
gap$mean.continent.GDP[gap$continent == "Oceania"] <- mean(gap$gdpPercap[gap$continent == "Oceania"])You can see how this can get pretty tedious, especially if we want to calculate more complicated or refined statistics. We could use loops or apply functions, but these can be difficult, slow, and error-prone.
split-apply-combine
The abstract problem we're encountering here is know as "split-apply-combine":
We want to split our data into groups (in this case continents), apply some calculations on that group, then combine the results together afterwards.
Luckily, dplyr offers a much cleaner, straight-forward solution to this problem.
First, let's remove the column we just made.
gap <- gap %>% select(-mean.continent.GDP) # drop a column with -
# OR
gap$mean.continent.GDP <- NULLgroup_by
We've already seen how filter() can help us select observations that meet certain criteria (in the above: continent == "Africa"). More helpful, however, is the group_by() function, which will essentially use every unique criteria that we could have used in filter().
A grouped_df can be thought of as a list where each item in the list is a data.frame which contains only the rows that correspond to a particular value for continent (at least in the example above).
Summarize Across Groups with summarize
group_by() on its own is not particularly interesting. It's much more exciting used in conjunction with the summarize() function.
This will allow us to create new variable(s) by applying transformations to variables in each of our groups (continent-specific data frames).
In other words, using the group_by() function, we split our original data frame into multiple pieces, to which we then apply summary functions (e.g. mean() or sd()) within summarize().
The output is a new data frame reduced in size, with one row per group.
gdp_bycontinents <- gap %>%
group_by(continent) %>%
summarize(mean_gdpPercap = mean(gdpPercap))
kable(head(gdp_bycontinents))| continent | mean_gdpPercap |
|---|---|
| Africa | 2194 |
| Americas | 7136 |
| Asia | 7902 |
| Europe | 14469 |
| Oceania | 18622 |
That allowed us to calculate the mean gdpPercap for each continent.
But it gets even better -- the function group_by() allows us to group by multiple variables. Let's group by year and continent.
gdp_bycontinents_byyear <- gap %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap))
kable(head(gdp_bycontinents_byyear))| continent | year | mean_gdpPercap |
|---|---|---|
| Africa | 1952 | 1253 |
| Africa | 1957 | 1385 |
| Africa | 1962 | 1598 |
| Africa | 1967 | 2050 |
| Africa | 1972 | 2340 |
| Africa | 1977 | 2586 |
That is already quite powerful, but it gets even better! You're not limited to defining 1 new variable in summarize().
gdp_pop_bycontinents_byyear <- gap %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop))
kable(head(gdp_pop_bycontinents_byyear))| continent | year | mean_gdpPercap | sd_gdpPercap | mean_pop | sd_pop |
|---|---|---|---|---|---|
| Africa | 1952 | 1253 | 983 | 4570010 | 6317450 |
| Africa | 1957 | 1385 | 1135 | 5093033 | 7076042 |
| Africa | 1962 | 1598 | 1462 | 5702247 | 7957545 |
| Africa | 1967 | 2050 | 2848 | 6447875 | 8985505 |
| Africa | 1972 | 2340 | 3287 | 7305376 | 10130833 |
| Africa | 1977 | 2586 | 4142 | 8328097 | 11585184 |
Add New Variables with mutate
What if we wanted to add these values to our original data frame instead of creating a new object?
For this, we can use the mutate() function, which is similar to summarize() except that it creates new variables in the same dataframe that you pass into it.
gapminder_with_extra_vars <- gap %>%
group_by(continent, year) %>%
mutate(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop))
kable(head(gapminder_with_extra_vars))| country | continent | year | lifeExp | pop | gdpPercap | mean_gdpPercap | sd_gdpPercap | mean_pop | sd_pop |
|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.8 | 8425333 | 779 | 5195 | 18635 | 42283556 | 1.13e+08 |
| Afghanistan | Asia | 1957 | 30.3 | 9240934 | 821 | 5788 | 19507 | 47356988 | 1.28e+08 |
| Afghanistan | Asia | 1962 | 32.0 | 10267083 | 853 | 5729 | 16416 | 51404763 | 1.36e+08 |
| Afghanistan | Asia | 1967 | 34.0 | 11537966 | 836 | 5971 | 14063 | 57747361 | 1.53e+08 |
| Afghanistan | Asia | 1972 | 36.1 | 13079460 | 740 | 8187 | 19088 | 65180977 | 1.74e+08 |
| Afghanistan | Asia | 1977 | 38.4 | 14880372 | 786 | 7791 | 11816 | 72257987 | 1.92e+08 |
We can use also use mutate() to create new variables prior to (or even after) summarizing the information.
gdp_pop_bycontinents_byyear <- gap %>%
mutate(gdp_billion = gdpPercap*pop/10^9) %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop),
mean_gdp_billion = mean(gdp_billion),
sd_gdp_billion = sd(gdp_billion))
kable(head(gdp_pop_bycontinents_byyear))| continent | year | mean_gdpPercap | sd_gdpPercap | mean_pop | sd_pop | mean_gdp_billion | sd_gdp_billion |
|---|---|---|---|---|---|---|---|
| Africa | 1952 | 1253 | 983 | 4570010 | 6317450 | 5.99 | 11.4 |
| Africa | 1957 | 1385 | 1135 | 5093033 | 7076042 | 7.36 | 14.5 |
| Africa | 1962 | 1598 | 1462 | 5702247 | 7957545 | 8.79 | 17.2 |
| Africa | 1967 | 2050 | 2848 | 6447875 | 8985505 | 11.44 | 23.2 |
| Africa | 1972 | 2340 | 3287 | 7305376 | 10130833 | 15.07 | 30.4 |
| Africa | 1977 | 2586 | 4142 | 8328097 | 11585184 | 18.70 | 38.1 |
mutate vs. summarize
It can be confusing to decide whether to use mutate or summarize. The key distinction is whether you want the output to have one row for each group or one row for each row in the original data frame:
mutate: creates new columns with as many rows as the original data framesummarize: creates a data frame with as many rows as groups
Note that if you use an aggregation function such as mean() within mutate() without using group_by(), you'll simply do the summary over all the rows of the input data frame.
And if you use an aggregation function such as mean() within summarize() without using group_by(), you'll simply create an output data frame with one row (i.e., the whole input data frame is a single group).
Arrange Rows with arrange
Let's say we want to sort the rows in our data frame according to values in a certain column. We can use the arrange() function to do this. For instance, let's organize our rows by year (recent first), and then by continent.
gapminder_with_extra_vars <- gap %>%
group_by(continent, year) %>%
mutate(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop)) %>%
arrange(desc(year), continent)
kable(head(gapminder_with_extra_vars))| country | continent | year | lifeExp | pop | gdpPercap | mean_gdpPercap | sd_gdpPercap | mean_pop | sd_pop |
|---|---|---|---|---|---|---|---|---|---|
| Algeria | Africa | 2007 | 72.3 | 33333216 | 6223 | 3089 | 3618 | 17875763 | 24917726 |
| Angola | Africa | 2007 | 42.7 | 12420476 | 4797 | 3089 | 3618 | 17875763 | 24917726 |
| Benin | Africa | 2007 | 56.7 | 8078314 | 1441 | 3089 | 3618 | 17875763 | 24917726 |
| Botswana | Africa | 2007 | 50.7 | 1639131 | 12570 | 3089 | 3618 | 17875763 | 24917726 |
| Burkina Faso | Africa | 2007 | 52.3 | 14326203 | 1217 | 3089 | 3618 | 17875763 | 24917726 |
| Burundi | Africa | 2007 | 49.6 | 8390505 | 430 | 3089 | 3618 | 17875763 | 24917726 |
Count Variable Quantities with count()
Finally, let's say we want to examine if thenumber of countries covered in the gapminder data set varies between years. We can use count() to count the number of observations within a set of parameters we choose.
Below, we will specify that we want to count() the number of observations in each year of the data set.
gap_count <- gap %>%
count(year)
kable(head(gap_count))| year | n |
|---|---|
| 1952 | 142 |
| 1957 | 142 |
| 1962 | 142 |
| 1967 | 142 |
| 1972 | 142 |
| 1977 | 142 |
We can confirm that each year in the data set contains the same number of observations. We can use similar syntax to answer other questions: for example, how many countries in each year have a GDP that is greater than $10,000 per capita?
gap_count_gdp <- gap %>%
filter(gdpPercap >= 10000) %>%
count(year)
kable(head(gap_count_gdp))| year | n |
|---|---|
| 1952 | 7 |
| 1957 | 12 |
| 1962 | 19 |
| 1967 | 22 |
| 1972 | 32 |
| 1977 | 41 |
Challenges
Challenge 1.
Use dplyr to create a data frame containing the median lifeExp for each continent.
Challenge 2.
Use dplyr to add a column to the gapminder dataset that contains the total population of the continent of each observation in a given year. For example, if the first observation is Afghanistan in 1952, the new column would contain the population of Asia in 1952.
Challenge 3.
Use dplyr to: (a) add a column called gdpPercap_diff that contains the difference between the observation's gdpPercap and the mean gdpPercap of the continent in that year, (b) arrange the dataframe by the column you just created, in descending order (so that the relatively richest country-years are listed first).
hint: You might have to ungroup() before you arrange().
Acknowledgments
Some of these materials in this module were adapted from: